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Abstract. We define a class of Euclidean distances on weighted graphs, 
enabling to perform thermodynamic soft graph clustering. The class can 
be constructed form the "raw coordinates" encountered in spectral clus- 
tering, and can be extended by means of higher-dimensional embeddings 
(Schoenberg transformations). Geographical flow data, properly condi- 
tioned, illustrate the procedure as well as visualization aspects. 
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1 Introduction 

>. 

C^ . In a nutshell (see e.g. Shi and Malik (2000); Ng, Jordan and Weiss (2002); von 

Luxburg (2007) for a review), spectral graph clustering consists in 

^D I A) constructing a features-based similarity or affinity matrix between n objects 

r-x^ . B) performing the spectral decomposition of the normalized affinity matrix, and 

^^ ' representing the objects by the corresponding eigenvectors or raw coordinates 

^^ . C) applying a clustering algorithm on the raw coordinates. 

The present contribution focuses on (C) thermodynamic clustering (Rose et 
al. 1990; Bavaud 2009), an aggregation-invariant soft i^- means clustering based 
S^ ■ upon Euclidean distances between objects. The latter constitute distances on 

J-H I weighted graphs^ and are constructed from the raw coordinates (B), whose form 

- happens to be justified from presumably new considerations on equivalence be- 

tween vertices (Section l3.3p . Geographical /^oii' data iWustidite the theory (Section 
2]). Once properly symmetrized, endowed with a sensible diagonal and normal- 
ized, flows define an exchange matrix (Section [2j), that is an affinity matrix (A) 
which might be positive definite or not. 

A particular emphasis is devoted to the definition of Euclidean distances on 
weighted graphs and their properties (Section [3]). For instance, diffusive and chi- 
square distances are focused^ that is zero between equivalent vertices. Commute- 
time and absorption distances are not focused, but their values between equiva- 
lent vertices possess an universal character. All these distances, whose relation- 
ships to the shortest-path distance on weighted graphs is partly elucidated, differ 



in the way eigenvalues are used to scale the raw coordinates. Allowing further 
Schoenberg transformations (Definition [3]) of the distances still extends the class 
of admissible distances on graphs, by means of a high-dimensional embedding 
familiar in the Machine Learning community. 

2 Preliminaries and notations 

Consider n objects, together with an exchange matrix E = (e^j), that is a n x n 
non- negative, symmetric matrix, whose components add up to unity (Berger and 
Snell 1957). E can be obtained by normalizing an affinity of similarity matrix, 
and defines the normalized adjacency matrix of a weighted undirected graph 
(containing loops in general), where Cij is the weight of edge {ij) and fi = 
Xl?=i ^ij is the relative degree or weight of vertex i, assumed strictly positive. 

2.1 Eigenstructure 

P = (pij) with pij = Cij/ fi is the transition matrix of a reversible Markov chain, 
with stationary distribution /. The t-step exchange matrix is E^'^^ = LfP^^ where 
n is the diagonal matrix containing the weights /. In particular, assuming the 
chain to be regular (see e.g. Kijima 1997) 

£;(0) = jj £;(2) ^ En-^E . E^"^^ = ff 

P is similar to the symmetric, normalized exchange matrix n~^ En~^ (see e.g. 
Chung 1997), and share the same eigenvalues 1 = Aq > Ai > A2 > . . . Xn-i ^ — 1- 
It is well-known that the second eigenvalue Ai attains its maximum value 1 iff 
the graph contains disconnected components, and Xn-i = — 1 iff the graph is 
bipartite. We note U' AU the spectral decomposition of the normalized exchange 
matrix, where A is diagonal and contains the eigenvalues, and U = {ui^) is 
orthonormal and contains the normalized eigenvectors. In particular, uq = ^/J is 
the eigenvector corresponding to the trivial eigenvalue Aq = 1. Also, the spectral 
decomposition of higher-order exchange matrices reads 7I~ 2^(^)77- 2 = UA^U'. 

2.2 Hard and soft partitioning 

A soft partition of the n objects into m groups is specified by a n x tti membership 
matrix Z = {zig)^ whose components (obeying Zig > and Xl^i ^ig — 1) quan- 
tify the membership degree of object i in group g. The relative volume of group 
g is pg = ^^ fiZig. The components 6gh = Xl^ fiZigZih of the mx m matrix O = 
Z'UZ measure the overlap between groups g and h. In particular, Ogg/ pg < 1 
measures the hardness of group g. The components agh = Xlij ^ij^ig^jh of the 
m X m matrix A = Z'EZ measure the association between groups g and h. 

A group g can also be specified by the objects it contains, namely by the dis- 
tributionir^ with components 7rf = fiZig/pg^ obeying ^^ 7rf = 1 by construction. 
The object-group mutual information 



I{0, Z) = H{0) + H{Z) - H{0, ^) = - E. f^ ln/^ - E, P. In P. + E., f^z^, Hf^^ 



^9) 



measures the object-group dependence or cohesiveness (Cover and Thomas 1991). 

A partition is hard if each object belongs to an unique group, that is if the 
memberships are of the form Zig = I{i G ^), or equivalently if z|^ = Zig for ah 
z, ^, or equivalently if Ogg = pg for all g^ or still equivalently if the overall softness 
H{Z\0) = H{Z) — I{0, Z) takes on its minimum value of zero. 

Also, H{0) < Inn, with equality iff fi = 1/n, that is if the graph is regular. 



2.3 Spectral versus soft membership relaxation 

In their presentation of the Ncut-driven spectral clustering, Yu and Shi (2003) 
(see also Nock et al. 2009) determine the hard nx m membership Z maximizing 

m 

e[Z] = ^ ^ = ^ ^ = tT{X'EX) where X[Z] = Z 0-"^[Z] 

under the constraint X'lIX = I. Relaxing the hardness and non-negativity con- 
ditions, they show the solution to be e[Zo] = 1 + "^^=1 ^a^ attained with an 
optimal "membership" of the form Zq = XqRO^ where R is any orthonormal 
m X m matrix and Xq = (1, xi, . . . , x^, . . . , Xm-i) is the n x m matrix formed 
by the unit vector followed by of the first raw coordinates (Sec. 13. 3p . The above 
spectral relaxation of the memberships, involving the eigenstructure of the nor- 
malized exchange matrix, completely differs from the soft membership relaxation 
which will be used in Section [321 preserving posit ivity and normalization of Z. 



3 Euclidean distances on weighted graphs 

3.1 Squared Euclidean distances 

Consider a collection of n objects together with an associated pairwise distance. 
A successful clustering consists in partitioning the objects into m groups, such 
that the average distances between objects belonging to the same (different) 
group are small (large). The most tractable pairwise distance is, by all means, the 
squared Euclidean distance Dij = Yll=ii^ic ~ ^jcY i where xic is the coordinate 
of object i in dimension c. Its virtues follow from Huygens principles 

Y^PjDij = Dip -^Ap Ap = Y^PjDjp = - Y^piPjDij (1) 

J 3 ij 

where Pi represents a (possibly non positive) signed distribution^ i.e. obeying 
^iPi = 1, Dip is the squared Euclidean distance between i and the centroid 
of coordinates Xpc = Xl^Pi^ic, and Ap the average pairwise distance or inertia. 
Equations ([1]) are easily checked using the coordinates, although the latter do not 
explicitly appear in the formulas. To that extent, squared Euclidean distances 
enable a feature-free formalism, a property shared with the kernels of Machine 
Learning, and to the "kernel trick" of Machine Learning amounts an equivalent 



"distance trick" (Scholkopf 2000; Williams 2002), as expressed by the well-known 
Classical Multidimensional Scaling (MDS) procedure. Theorem [T] below presents 
a weighted version (Bavaud 2006), generalizing the uniform MDS procedure 
(see e.g. Mardia et al. 1979). Historically, MDS has been developed from the 
independent contributions of Schoenberg (1938b) and Young and Householder 
(1938). The algorithm has been popularized by Torgeson (1958) in Data Analysis. 

Theorem 1 (weighted classical MDS). The dissimilarity square matrix D 

between n objects with weights p is a squared Euclidean distance iff the scalar 

product matrix B = —^HDH' is (weakly) positive definite (p.d.), where H is 

the n X n centering matrix with components hij = Sij — pj . By construction, 

Bij = —^{Dij — Dip — Djp) and Dij = Ba + Bjj — 2Bij. The object coordi- 

i _i 
nates can be reconstructed as xi^ = pilp^ '^vi^ for /3 = 1,2,...^ where the pjs 

are the decreasing eigenvalues and the vijs are the eigenvectors occurring in the 

spectral decomposition K = VMV of the weighted scalar product or kernel 

K with components Kij = ^/PiPj Bij . This reconstruction provides the optimal 

low- dimensional reconstruction of the inertia associated to p 



Z\ = - "^piPjDij = tr{K) = ^ /i^ . 

ij (3>1 

Also, the Euclidean (or not) character of D is independent of the choice of p. 



3.2 Thermodynamic clustering 

Consider the overall objects weight /, defining a centroid denoted by 0, together 
with m soft groups defined by their distributions tt^ for g = 1,...,7ti, with 
associated centroids denoted by g. By ([T|), the overall inertia decomposes as 

^ = E^ f^D^O = E^^ fz^gD.Q = ^^ Pg ^^ Tlf D,o = ^^ Pg[DgO + Ag] ^Ab+Aw 

where Ab[Z] = Eq Pg^gO is the between-groups inertia, and Aw[Z] = Eo Pg^g 
the within-groups inertia. The optimal clustering is then provided by the n x m 
membership matrix Z minimizing Aw[Z]^ or equivalently maximizing Ab[Z]. 
The former functional can be shown to be concave in Z (Bavaud 2009), implying 
the minimum to be attained for hard clusterings. 

Hard clustering is notoriously computationally intractable and some kind 
of regularization is required. Many authors (see e.g. Huang and Ng (1999) or 
Filippone et al. (2008)) advocate the use of the c-means clustering^ involving 
a power transform of the memberships. Despite its efficiency and popularity, 
the c-means algorithm actually suffers from a serious formal defect, questioning 
its very logical foundations: its objective function is indeed not aggregation- 
invariant^ that is generally changes when two groups g and h supposed equivalent 
in the sense tt^ = tt^ are merged into a single group [g U h] with membership 
^i[hug] = ^ih + Zjh (Bavaud 2009). 

An alternative, aggregation-invariant regularization is provided by the ther- 
modynamic clustering^ minimizing over Z the free energy F[Z] = Aw[Z]-\-TI[Z]^ 



where I[Z] = I{0^ Z) is the objects-groups mutual information and T > the 
temperature (Rose et al. 1990; Rose 1998; Bavaud 2009). The resulting member- 
ship is determined iteratively through 

^ _ pg exp{-Dig/T) 



and converges towards a local minimum of the free energy. Equation (|2j) amounts 
to fitting Gaussian clusters in the framework of model-based clustering. 

3.3 Three nested classes of squared Euclidean distances 

Equation ([2]) solves the K-way soft graph clustering problem, given of course the 
availability of a sound class of squared Euclidean distances on weighted graphs. 
Definitions [2] and [3] below seem to solve the latter issue. 

Consider a graph possessing two distinct but equivalent vertices in the sense 
their relative exchange is identical with the other vertices (including themselves). 
Those vertices somehow stand as duplicates of the same object, and one could 
as a first attempt require their distance to be zero. 

Definition 1 (Equivalent vertices; focused distances). Two distinct ver- 
tices i and j are equivalent^ noted i ~ j, if Cik/fi = ^jk/ fj for all k. A distance 
is focused if Dij = for i ^ j . 

Proposition 1. i ^ j iff Xia — Xja for all a > 1 such that X^ 7^ 0^ where 
Xia = Uia/Vfi is the raw coordinate of vertex i in dimension a. 

The proof directly follows from the substitution e^/e -^ fiCjk/fj in the iden- 
_ i _i 
tity ^k fi ^ ^ikfk ^^ka = ^aUia- Notc that the Condition trivially holds for the 

_i 
trivial eigenvector a = 0, in view of f- ^ Uio = 1 for all i. It also holds trivially 

for the "completely connected" weighted graph e^^^^ = /^/j, where all vertices 

are equivalent, and all eigenvalues are zero, except the trivial one. 

— i — i 

Hence, any expression of the form Dij = X]a>i d^ifi ^ ^icx — fj ^Uja) with 

^Q, > constitutes an admissible squared Euclidean distance, obeying Dij = 

for i ^ j^ provided ^^ = i/ Aq, = 0. The quantities g^ are non- negative, but 

otherwise arbitrary; however, it is natural to require the latter to depend upon 

the sole parameters at disposal, namely the eigenvalues, that is to set g^ = ^(Aq,). 

Definition 2 (Focused and Natural Distances on Weighted Graphs). 

Let E he the exchange matrix associated to a weighted graph, and define E^ := 
n~^{E — E^^^)n~^ , the standardized exchange matrix. The class 0/ focused 
squared Euclidean distances on weighted graphs is 

Dij = Bii^Bjj -2Bij, where B = n-^Kn'^ and K = g{E') 

where g{X) is any non-negative sufficiently regular real function with g{0) = 0. 
Dropping the requirement g{0) = defines the more general class of natural 
squared Euclidean distances on weighted graphs. 

If g{l) is finite, K can also he defined as K = g{n~^En~^) = Ug{A)U' . 



First, note the standardized exchange matrix to result from a "centering" 
(ehminating the trivial eigendimension) followed by a "normalization" : 



Secondly, B is the matrix of scalar products appearing in Theorem [H The re- 
sulting optimal reconstruction coordinates are \/ g{\a) ^iai where the quantities 

Xia = fi ^Uia are the raw coordinates of vertex i in dimension a = 1, 2, . . . ap- 
pearing in Proposition [1] - which yields a general rationale for their widespread 
use in clustering and low-dimensional visualization. Thirdly, the matrix g{E^) 
can be defined, for ^(A) regular enough, as the power expansion in {E^Y with 
coefficients given by the power expansion of ^(A) in A^ for t = 0, 1, 2, — Fi- 
nally, the two variants of B appearing in Definition [2] are identical up to a matrix 
^(1)1^1^, leaving D unchanged. 

If ^(1) = oo, the distance between vertices belonging to distinct irreducible 
components becomes infinite: recall the graph to be disconnected iff Ai = 1. 
Such distances will be referred to as irreducible. 

Natural distances are in general not focused. The distances between equiva- 
lent vertices are however universal^ that is independent of the details of the graph 
or of the associated distance (Proposition [2]). To demonstrate this property, con- 
sider first an equivalence class J := {k\k ^ j} containing at least two equivalent 
vertices. Aggregating the vertices in J results in a new hx h exchange matrix E 
with n = (n- I J| - 1), with components ejj = Y^^j^j Cij, ejk = e/cj = Y^jeJ ^J^ 
foik^J and fj = ^j^j fj^ the other components remaining unchanged. 

Proposition 2. Let D be a natural distance and consider a graph possessing an 
equivalence class J of size \J\ > 2. Consider two distinct elements i ^ j of J 
and let k ^ J. Then 

Dij = 5(0) (| + 1) D,j = 5(0) (| - i) Aj= g{0) ^^ . 

Moreover, the Pythagorean relation D^j = D^^j + Djj holds. 

Proof: consider the eigenvalues A/3 and eigenvectors ii^, associated to the ag- 
gregated graph E^ for /3 = 0, . . . , n. One can check that, due to the collinearity 
generated by the \J\ equivalent vertices, 

• h among the original eigenvalues A^ coincide with the set of aggregated 

eigenvalues A/3 (non null in general), with corresponding eigenvectors Ujf3 = 
i ~_i 

// fj ' ^J(^ fo^ J ^ J ^^^ ^kl3 = Ukl3 foik^J 

• |J — 1| among the original eigenvalues A^ are zero. Their corresponding 
eigenvectors are of the form Uj^ = hj^ for j ^ J and Uk-^ = for /c ^ J, 
where the h^ constitute the | J| — 1 columns of an orthogonal \J\ x \J\ matrix, 

the remaining column being {f^fj^ )jeJ- 



Identities in Proposition [2] follow by substitution. For instance, 

General at it is, the class of squared Euclidean distances on weighted graphs of 
Definition [2] can still be extended: a wonderful result of Schoenberg (1938a), still 
apparently little known in the Statistical and Machine Learning community (see 
however the references in Kondor and Lafferty (2002); Hein et al. (2005)) asserts 
that the componentwise correspondence Dij = <p{Dij) transforms any squared 
Euclidean distance D into another squared Euclidean distance D^ provided that 

i) (l){D) is positive with ^(0) = 
ii) odd derivatives (j)' {D)^ (j)'" {D)^... are positive 
iii) even derivatives (j)" {D)^ (j)"" {D)^... are negative. 

For example, ^(D) = D^ (for < a < 1) and 0(D) = 1 - exp(-6D) (for 6 > 0) 
are instances of such Schoenberg transformations (Bavaud 2010). 

Definition 3 (Extended Distances on Weighted Graphs). The class of 
extended squared Euclidean distances on weighted graphs is 

where (l){D) is a Schoenberg transformation (as specified above), and Dij is a 
natural squared Euclidean distance associated to the weighted graph E, in the 
sense of Definition\^ 

3.4 Examples of distances on weighted graphs 

The chi-square distance The choice ^(A) = A^ entails, together with (|3j) 



A = iv{K)=iT{{E-^f) = Y, 



\^ij JiJj) _ ^2 



hfj 



X 



which is the familiar chi-square measure of the overall rows-columns dependency 
in a (square) contingency table, with distance Dfj = ^^ fj^^if^^eik — f~^ejkY ^ 
well-known in the Correspondence Analysis community (Lafon and Lee 2006; 
Greenacre 2007 and references therein). Note that D^- = for z ~ j, as it must. 

The diffusive distance The choice ^(A) = A is legitimate, provided the ex- 
change matrix is purely diffusive^ that is p.d. Such are typically the graphs re- 
sulting from inter-regional migrations (Sec. H]) or social mobility tables (Bavaud 
2008). As most people do not change place or status during the observation time, 
the exchange matrix is strongly dominated by its diagonal, and hence p.d. 

Positive definiteness also occurs for graphs defined from the affinity matrix 
exp(— /3I)ij) (Gaussian kernel), as in Belkin and Niyogi (2003), among many 



others. Indeed, distances derived from the Gaussian kernel provide a prototypical 
example of Schoenberg transformation (see Definition [3]). By contrast, the affinity 
H^ij ^ ^^) ^^^^ t)y Tenenbaum et al. (2000) is not p.d. 

The corresponding distance, together with the inertia, plainly read 

h Jj Jijj ■ Ji 



The "frozen" distance The choice ^(A) = 1 produces, for any graph, a result 
identical to the application of any function ^(A) (with ^(1) = 1) to the purely 
diagonal "frozen" graph E'^^^ = 71, namely (compare with Proposition [2]) : 

This "star-like" distance (Critchley and Fichet 1994) is embeddable in a tree. 

The average commute time distance The choice ^(A) = (1 — A)~^ cor- 
responds to the average commute time distance; see Fouss et al. (2007) for a 
review and recent results. The amazing fact that the latter constitutes a squared 
Euclidean distance has only be recently explicitly recognized as such, although 
the key ingredients were at disposal ever since the seventies. 

Let us sketch a derivation of this result: on one hand, consider a random 
walk on the graph with probability transition matrix P = n~^E^ and let Tj 
denotes the first time the chain hits state j. The average time to go from i to j 
is rriij = EiiTj)^ with ma = 0, where Ei{.) denotes the expectation for a random 
walk started in i. Considering the state following i yields for i ^ j the relation 
rriij = 1 + ^]^Pik'^kj^ with solution (Kemeny and Snell (1976); Aldous and Fill, 
draft chapters) rrnj = {yjj - yij)/fj, where Y = H'^^^^^iE^^^ - ^(^)) = 
(^£;(o) _ £; ^ £^*^^))~^77 is the so-called fundamental matrix of the Markov chain. 
On the other hand. Definition [2] yields K = {I - E')-^ = n^{E^^^ - E ^ 
Eioo)yijj^ = n^Yn-^, and thus B = YH'^ = U'^Y . Hence 

T^com _Rio oo _^*^i yjj y^j y^j _ ^ , ^ 

JJij - Bii + Bjj - 2Bij - — + — - rriij ^ ruji 

/* Jj Jj Jj 

which is the average time to go from i to j and back to i, as announced. 

Consider, for future use, the Dirichlet form £{y) = | ^. • eij{yi — yj)'^^ and 
denote by y^ the solution of the "electrical" problem miiiy^dj ^{y)i where Cij 
denotes the set of vectors y such that yi = I and yj =0. Then y^ = PkiTi < 
Tj)^ where P/c(.) denotes the probability for a random walk started at k. Then 
Df/^ = l/^(^°) (Aldous and Fih, chapter 3). 

The shortest-path distance Let Fij be the set of paths with extremities i 
and j, where a path 7 G Fij consists of a succession of consecutive unrepeated 



edges denoted hy a = (/c,/) G 7, whose weights e^ represent conductances. 
Their inverses are resistances, whose sum is to be minimized by the shortest path 
7^ G Fij (not necessarily unique) on the weighted graph E. This setup generahzes 
the unweighted graphs framework, and defines the shortest path distance 






s^ 



mm 



We beheve the fohowing result to be new - although its proof simply combines a 
classical result published in the fifties (Beurling and Deny 1958) with the above 
"electrical" characterization of the average commute time distance. 



Proposition 3. D^f > Dl 



with equality for all i,j iff E is a weighted tree. 



Proof: let 7^ G Fij be the shortest-path between i and j. Consider a vector y 
and define dy^ = yi — yu for an edge a = (/c, /). Then 



(a) 



\yi-yj\ < Yl l^^^l ^ XI ^ 



-\dya\ 



ae-r^ 






Adyc^frHYl 



(c) 



< VWh 



a^j^ 



Hence D-^ > {yi—yj)'^/£{y) for all y, in particular for y^ defined above, showing 
Dij ^ Df^'^. Equality holds iff (a) y^ is monotonously decreasing along the path 
7^, (b) for all Of G 7^, dy^ = c/ca for some constant c, and (c) dy^Ca = for all 
a ^ 7^. (b), expressing Ohm's law U = RI in the electrical analogy, holds for 
y^, and (a) and (c) hold for a tree, that is a graph possessing no closed path. 

The shortest-path distance is unfocused and irreducible. Seeking to determine 
the corresponding function ^(A) involved in Definition [21 and/or the Schoenberg 
transformation (l){D) involved in Definition [3l is however hopeless: 

Proposition 4. D^P is not a squared Euclidean distance. 

Proof: a counter-example is provided (Deza and Laurent (1997) p. 83) by the 
complete bipartite graph i^2,3 of Figure 1: 




/O 1 1 1\ 




/O 2 1 1 1\ 


111 




2 111 


110 


D^P = 12 


110 2 2 


110 




112 2 


V 1 0^ 




Vl 1 2 2 0/ 



Fig. 1. Bipartite graph i^2,3, associated exchange matrix and shortest-path dis- 
tance 



The eigenvalues occurring in Theorem [T] are jui = 3, /i2 = 2.32, /is = 2, 
/i4 = and /i5 = —0.49, thus ruling out the possible squared Euclidean nature 
of D'P. 
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The absorption distance The choice ^(A) = (1 — p)/(l — pX) where < p < 1 
yields the absorption distance: consider a modified random walk, where, at each 
discrete step, a particle at i either undergoes with probability p a transition 
i ^ j (with probability Pij) or is forever absorbed with probability 1 — p into 
some additional "cemetery" state. The quantities %(p) = "average number of 
visits from i to j before absorption" obtain as the components of the matrix (see 
e.g. Kemeny and Sneh (1976) or Kijima (1997)) 

V{p) = (/ - pP)-^ = {n- pE)-^n with fiVij = fjVji and ^fiVij = -^ . 

Hence K = g{n~^En~^) = (1 — p)7l2 Vil~2 and Bij = {l — p)vij/fj, measur- 
ing the ratio of the average number of visits from i to j over its expected value 
over the initial state i. Finally, 



Ji Jj J: 



J 



By construction, lim^^o ^""^^(p) = Df"^"" and limp^i(l - py^D^'^^p) = I)^^^. 
Also, limp^i D^^^{p) = for a connected graph. 

The "sif ' distance The choice ^(A) = A^/(l — A) is the simplest one insuring 
an irreducible and focused squared Euclidean distance. Identity A^/(l — A) = 
1/(1 — A) — A — 1 readily yields (wether D^'^^ is Euclidean or not) 

r)"5^/= picom _ pfdif _ p.fro 

ij ^j ij ij ' 

4 Numerical experiments 

4.1 Inter-cantonal migration data 

The first data set consists of the numbers N = (n^j) of people inhabiting the 
Swiss canton i in 1980 and the canton j in 1985 (z,j = l,...,n = 26), with 
a total count of 6^039^313 inhabitants, 93% of which are distributed over the 
diagonal. N can be made brutally symmetric as \{nij + riji) or y/nijifiji^ or, 
more gently, by fitting a quasi- symmetric model (Bavaud 2002), as done here. 
Normalizing the maximum likelihood estimate yields the exchange matrix E. 
Raw coordinates Xi^ = Ui^ \pfi are depicted in Figure O By construction, they 
do not depend of the form of the function g{X) involved in Definition [2l but they 
do depend on the form of the Schoenberg transformation D = (j){D) involved 
in Definition [3l where they obtain as solutions of the weighted MDS algorithm 
(Theorem [1]) on Z), with unchanged weights / (Figure [3] (a) and (b)). 

Iterating (|2]) from an initial n x m membership matrix Zj^it (with m < n) at 
fixed T yields a membership Zo(T), which is by construction a local minimizer of 
the free energy F[Z, T]. The number M{Zq) < m of independent columns of Zq 
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Fig. 2. Proportion of Swiss-German speakers in the 26 Swiss cantons (left), and 
raw coordinates Xia associated to the inter-cantonal migrations, in dimensions 
a = 1, 2 (center) and a = 3, 4 (right). Colours code the linguistic regions, namely: 
1 = German, 2 = mainly German, 3 = mainly French, 4 = French and 5 = Italian. 
The central factorial map reconstructs fairly precisely the geographical map, and 
emphasizes the linguistic German- French barrier, known as "Rostigraben" . The 
linguistic isolation of the sole Italian-speaking canton, intensified by the Alpine 
barrier, is patent. 



measures the number of effective groups: equivalent groups, that is groups whose 
columns are proportional, could and should be aggregated, thus resulting in M 
distinct groups, without changing the free energy, since both the intra-group 
dispersion and the mutual information are aggregation-invariant (Bavaud 2009). 
In practice, groups g and h are judged as equivalent if their relative overlap 
(Section ESI) obeys Ogh/^OggOhh > 1 - 10"^^ 

Define the relative temperature as T^ei = T/A. One expects M = 1 for 
T^ei ^ 1 , and M = n for T^^i <C 1 , provided of course that the initial membership 
matrix contains at least n columns. We operate a soft hierarchical descendant 
clustering scheme, consisting in starting with the identity membership Zj^it = / 
for some T^ei <C 1, iterating (|2j) until convergence, and then aggregating the 
equivalent columns in Zq{T) into M effective groups. The temperature is then 
slightly increased, and, choosing the resulting optimum Zq{T) as the new initial 
membership, ([2]) is iterated again, and so forth until the emergence of a single 
effective group (M = 1) in the high temperature phase T^ei > 1- 

Numerical experiments (Figure [3|) actually conform to the above expecta- 
tions, yet with an amazing propensity for tiny groups p^ <C 1 to survive at 
high temperature, that is before to be aggregated in the main component. This 
metastable behaviour is related to the locally optimal nature of the algorithm; 
presumably unwanted in practical applications, it can be eliminated by forcing 
group coalescence if, for instance, H{Z) or F[Z]— A become small enough. 

The softness measure of the clustering H{Z\0) is expected to be zero in 
both temperature limits, since both the identity matrix and the single-group 
membership matrix are hard. We have attempted to measure the quality of the 
clustering Z with respect to the regional classification R of Figure [2] by the 
"variation of information" index H{Z) + H{R) — 2/(Z, R) proposed by Meila 
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b) 



c) 



raw coordinates , dimension = 1 




relative temperature Trei 




witliin-group dispersion Aw 



relative temperature Trei 



relative temperature Trei 



Fig. 3. Raw coordinates extracted from weighted MDS after applying Schoen- 
berg transformations D = (^{D''''"') with (/>(1^) = D^-^ (a), and ^'{D) = 
1 — exp(— 6D) with b = 1/(4Z\ ''''"') (b). Decrease of the number of effective groups 
with the temperature (c); beside the main component, two microscopic groups 
of size P2 = 6 • 10~^ and pa = 2 • 10~^^ survive at T^ei = 2. (d) is the so-cahed 
rate- distortion function of Information Theory; its discontinuity at T^^t = 0.406 
betrays a phase transition between a cold regime with numerous clusters and a 
hot regime with few clusters (Rose et al. 1990; Bavaud 2009). Behaviour of the 
overall softness H{Z\0) (e) (Section [2^ and of the clusters-regions variation of 
information (f) (see text). 
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(2005). Further investigations, beyond the scope of this paper, are obviously 
stih to be conducted in this direction. 

The stabihty of the effective number of clusters around T^ei = 1 might en- 
courage the choice of the solution with M = 7 clusters. Rather disappointingly, 
the latter turns out (at T,^i = 0.8, things becoming even worse at higher tem- 
perature) to consist of one giant main component of pi > 0.97, together with 
6 other practically single-object groups (UR, OW, NW, GL, AI, JU), totalizing 
less than three percent of the total mass (see also Section [5]). 

4.2 Commuters data 

The second data set counts the number of commuters A^ = riij between the 
n = 892 French speaking Swiss communes, living in commune i and working in 
commune j in 2000. A total of 733^037 people are involved, 49% of which are 
distributed over the diagonal. As before, the exchange matrix E is obtained after 
fitting a quasi-symmetric model to A^. The first two dimensions a = 1, 2 of the 



^ 



unmodified raw coordinates , dimension = 1 




raw coordinates 



from E, 



dimension = 1 




e) 



V coorcifnate°s^froni°E, di 



raw coordinates from D , dimension = 1 



i- - 

<UJ 
CD 




raw coordinates from E° dimension = 3 




raw coordinates from D , dimension 



Fig. 4. Raw coordinates associated to the unmodified exchange matrix E are 
unable to approximate the geographical map (a), in contrast to (b), (c) and 
(d), based upon the diagonal- free exchange matrix E. Colours code the cantons, 
namely BE=brown, FR=black, GE=orange, JU=violet, NE=blue, VD=green, 
VS=red. In particular, the central position of VD (compare with Figure [2j) is 
confirmed, (e) and (f) represent the low-dimensional coordinates obtained by 
MDS from D^'^'^p (g]). 



raw coordinates Xia = Uia.l\fYi ^^^ depicted in Figure H] a). The objects cloud 
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consists of all the communes (up, left) except a single one (down, right), namely 
"Roche d'Or" (JU), containing 15 active inhabitants, 13 of which work in Roche 
d'Or. Both the very high value of the proportion of stayers eaj fi and the low 
value of the weight ji make Roche d'Or (together with other communes, to a 
lesser extent) quasi-disconnected from the rest of the system, hence producing, 
in accordance to the theory, eigenvalues as high as Ai = .989, A2 = .986, ... , 
A30 > .900... 

Theoretically flawless as is might be, this behavior stands as a complete geo- 
graphical failure. As a matter of fact, commuters (and migration)-based graphs 
are young, that is E is much closer to its short-time limit E^^"^ than to its equi- 
librium value E^^\ Consequently, diagonal components are huge and equivalent 
vertices in the sense of Definition [1] cannot exist: for /c = z ^ j, the proportion 
of stayers eu/fi is large, while Cij/fj is not. 

Attempting to consider the Laplacian E — E^^^ instead oi E does not improve 
the situation: both matrices indeed generate the same eigenstructure, keeping the 
order of eigenvalues unchanged. A brutal, albeit more effective strategy consists 
in plainly destroying the diagonal exchanges, that is by replacing E by the 
diagonal- free exchange matrix E, with components and associated weights 
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Defining E as the new exchange matrix yields (Sections [2] and [3|) new weights /, 
eigenvectors ^, eigenvalues A (with An = 0), raw coordinates X and distances 
l), as illustrated in Figure |4]b), c) and d). 

However, an example of equivalent nodes in the sense of Definition [T] is still 
unlikely to be found, since = eu/fi 7^ ^ijl fj i^ general. A weaker concept of 
equivalence consists in comparing i ^ jhy means of their transition probabilities 
towards the other vertices k 7^ z, j, that is by means of the Markov chain con- 
ditioned to the event that the next state is different. Such Markov transitions 
approximate the so-called jump process, if existing (see e.g. Kijima (1997) or 
Bavaud (2008)). Their associated exchange matrix is precisely given by E. 

Definition 4 (Weakly equivalent vertices; weakly focused distances). 

Two distinct vertices i and j are weakly equivalent^ noted i ^ j , if Cik/ fi = 
Cjk/ fj for all k y^ i,j,. A distance is weakly focused if Dij = whenever i ^ j. 

By construction, the following "jump" distance is squared Euclidean and 
weakly focused: 

^ I ^^^ j Jijk JjJk ^ Jk Ji Jj JiJj Ji Jj 

The restriction k 7^ i, j in (|4]) complicates the expression of D^'^^P in terms 
of the eigenstructure (/7,yl), and the existence of raw coordinates Xia^ adapted 
to the diagonal- free case, and justified by an analog of Proposition [TJ remains 
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open. In any case, jump distances (j4]) are well defined, and yield low-dimensional 
coordinates of the 892 communes by weighted MDS (Theorem [T]) with weights 
/, as illustrated in Figure |4]e) and f). 




c) 



relative temperature T,- 



within-group dispersion Aw 



relative temperature T^i 



Fig. 5. Comparison between the clustering obtained from D''^ (in red) and D^'''^^ 
(in green): evolution of the number of effective clusters with the temperature 
(a), rate-distortion function (b) and overall softness measure (c). In (b), A^""""^ 
has been multiplied by a factor five to fit to the scale. 



5 Conclusion 

Our first numerical results confirm the theoretical coherence and the tractability 
of the clustering procedure presented in this paper. Yet, further investigations 
are certainly required: in particular, the precise role that the diagonal compo- 
nents of the exchange matrix should play into the construction of distances on 
graphs deserves to be thoroughly elucidated. Also, the presence of fairly small 
clusters in the clustering solutions of SectionHl from which the normalized cut al- 
gorithm Ncut was supposed to prevent, should be fully understood. Our present 
guess is that small clusters are inherent to the spatial nature of the data un- 
der consideration: elongated and connected clouds as those of Figure |4] cannot 
miraculously split into well-distinct groups, irrespectively of the details of the 
clustering algorithm (classical chaining problem). This being said, squared Eu- 
clidean are closed under addition and convex mixtures. Hence, an elementary 
yet principled remedy could simply consist in adding spatial squared Euclidean 
distances to the flow-induced distances investigated in the present contribution. 
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